nlp_architect.data package

Subpackages

Submodules

nlp_architect.data.amazon_reviews module

class nlp_architect.data.amazon_reviews.Amazon_Reviews(review_file, run_balance=True)[source]

Bases: object

Take the *.json file of Amazon reviews as downloaded from http://jmcauley.ucsd.edu/data/amazon/ Then does data cleaning and balancing, as well as transforms the reviews 1-5 to a sentiment

process()[source]
nlp_architect.data.amazon_reviews.review_to_sentiment(review)[source]

nlp_architect.data.babi_dialog module

class nlp_architect.data.babi_dialog.BABI_Dialog(path='.', task=1, oov=False, use_match_type=False, use_time=True, use_speaker_tag=True, cache_match_type=False, cache_vectorized=False)[source]

Bases: object

This class loads in the Facebook bAbI goal oriented dialog dataset and vectorizes them into user utterances, bot utterances, and answers.

As described in: “Learning End-to-End Goal Oriented Dialog”. https://arxiv.org/abs/1605.07683.

For a particular task, the class will read both train and test files and combine the vocabulary.

Parameters:
  • path (str) – Directory to store the dataset
  • task (str) – a particular task to solve (all bAbI tasks are train and tested separately)
  • oov (bool, optional) – Load test set with out of vocabulary entity words
  • use_match_type (bool, optional) – Flag to use match-type features
  • use_time (bool, optional) – Add time words to each memory, encoding when the memory was formed
  • use_speaker_tag (bool, optional) – Add speaker words to each memory (<BOT> or <USER>) indicating who spoke each memory.
  • cache_match_type (bool, optional) – Flag to save match-type features after processing
  • cache_vectorized (bool, optional) – Flag to save all vectorized data after processing
data_dict

Dictionary containing final vectorized train, val, and test datasets

Type:dict
cands

Vectorized array of potential candidate answers, encoded

Type:np.array
as integers, as returned by BABI_Dialog class. Shape = [num_cands, max_cand_length]
num_cands

Number of potential candidate answers.

Type:int
max_cand_len

Maximum length of a candidate answer sentence in number of words.

Type:int
memory_size

Maximum number of sentences to keep in memory at any given time.

Type:int
max_utt_len

Maximum length of any given sentence / user utterance

Type:int
vocab_size

Number of unique words in the vocabulary + 2 (0 is reserved for a padding symbol, and 1 is reserved for OOV)

Type:int
use_match_type

Flag to use match-type features

Type:bool, optional
kb_ents_to_type

For use with match-type features, dictionary of entities found in the dataset mapping to their associated match-type

Type:dict, optional
kb_ents_to_cand_idxs

For use with match-type features, dictionary mapping from each entity in the knowledge base to the set of indicies in the candidate_answers array that contain that entity.

Type:dict, optional
match_type_idxs

For use with match-type features, dictionary mapping from match-type to the associated fixed index of the candidate vector which indicated this match type.

Type:dict, optional
static clean_cands(cand)[source]

Remove leading line number and final newline from candidate answer

compute_statistics()[source]

Compute vocab, word index, and max length of stories and queries.

create_cands_mat(data_split, cache_match_type)[source]

Add match type features to candidate answers for each example in the dataaset. Caches once complete.

create_match_maps()[source]

Create dictionary mapping from each entity in the knowledge base to the set of indicies in the candidate_answers array that contain that entity. Will be used for quickly adding the match type features to the candidate answers during fprop.

encode_match_feats()[source]

Replace entity names and match type names with indexes

get_vocab(dialog)[source]

Compute vocabulary from the set of dialogs.

load_candidate_answers()[source]

Load candidate answers from file, compute number, and store for final softmax

load_data()[source]

Fetch and extract the Facebook bAbI-dialog dataset if not already downloaded.

Returns:training and test filenames are returned
Return type:tuple
load_kb()[source]

Load knowledge base from file, parse into entities and types

one_hot_vector(answer)[source]

Create one-hot representation of an answer.

Parameters:answer (string) – The word answer.
Returns:One-hot representation of answer.
Return type:list
static parse_dialog(fn, use_time=True, use_speaker_tag=True)[source]

Given a dialog file, parse into user and bot utterances, adding time and speaker tags.

Parameters:
  • fn (str) – Filename to parse
  • use_time (bool, optional) – Flag to append ‘time-words’ to the end of each utterance
  • use_speaker_tag (bool, optional) – Flag to append tags specifiying the speaker to each utterance.
process_interactive(line_in, context, response, db_results, time_feat)[source]

Parse a given user’s input into the same format as training, build the memory from the given context and previous response, update the context.

vectorize_cands(data)[source]

Convert candidate answer word data into vectors.

If sentence length < max_cand_len it is padded with 0’s

Parameters:data (list of lists) – list of candidate answers split into words
Returns:padded numpy array of word indexes forr all candidate answers
Return type:tuple (2d numpy array)
vectorize_stories(data)[source]

Convert (memory, user_utt, answer) word data into vectors.

If sentence length < max_utt_len it is padded with 0’s If memory length < memory size, it is padded with empty memorys (max_utt_len 0’s)

Parameters:data (tuple) – Tuple of memories, user_utt, answer word data.
Returns:Tuple of memories, memory_lengths, user_utt, answer vectors.
Return type:tuple
words_to_vector(words)[source]

Convert a list of words into vector form.

Parameters:words (list) – List of words.
Returns:Vectorized list of words.
Return type:list
nlp_architect.data.babi_dialog.pad_sentences(sentences, sentence_length=0, pad_val=0.0)[source]

Pad all sentences to have the same length (number of words)

nlp_architect.data.babi_dialog.pad_stories(stories, sentence_length, max_story_length, pad_val=0.0)[source]

Pad all stories to have the same number of sentences (max_story_length).

nlp_architect.data.conll module

class nlp_architect.data.conll.ConllEntry(eid, form, lemma, pos, cpos, feats=None, parent_id=None, relation=None, deps=None, misc=None)[source]

Bases: object

nlp_architect.data.conll.normalize(word)[source]

nlp_architect.data.fasttext_emb module

class nlp_architect.data.fasttext_emb.Dictionary(id2word, word2id, lang)[source]

Bases: object

Merges word2idx and idx2word dictionaries :param id2word dictionary: :param word2id dictionary: :param language of the dictionary:

Usage:
dico.index(word) - returns an index dico[index] - returns the word
check_valid()[source]

Check that the dictionary is valid.

index(word)[source]

Returns the index of the specified word.

class nlp_architect.data.fasttext_emb.FastTextEmb(path, language, vocab_size, emb_dim=300)[source]

Bases: object

Downloads FastText Embeddings for a given language to the given path. :param path: Local path to copy embeddings :type path: str :param language: Embeddings language :type language: str :param vocab_size: Size of vocabulary :type vocab_size: int

Returns:Returns a dictionary and reverse dictionary Returns a numpy array with embeddings in emb_sizexvocab_size shape
load_embeddings()[source]
read_embeddings(filepath)[source]
nlp_architect.data.fasttext_emb.get_eval_data(eval_path, src_lang, tgt_lang)[source]

Downloads evaluation cross lingual dictionaries to the eval_path :param eval_path: Path where cross-lingual dictionaries are downloaded :param src_lang: Source Language :param tgt_lang: Target Language

Returns:Path to where cross lingual dictionaries are downloaded

nlp_architect.data.glue_tasks module

class nlp_architect.data.glue_tasks.ColaProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the CoLA data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.InputFeatures(input_ids, input_mask, segment_ids, label_id, valid_ids=None)[source]

Bases: object

A single set of features of data.

class nlp_architect.data.glue_tasks.MnliMismatchedProcessor[source]

Bases: nlp_architect.data.glue_tasks.MnliProcessor

Processor for the MultiNLI Mismatched data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

class nlp_architect.data.glue_tasks.MnliProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the MultiNLI data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.MrpcProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the MRPC data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.QnliProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the QNLI data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.QqpProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the QQP data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.RteProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the RTE data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.Sst2Processor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the SST-2 data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.StsbProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the STS-B data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.glue_tasks.WnliProcessor[source]

Bases: nlp_architect.data.utils.DataProcessor

Processor for the WNLI data set (GLUE version).

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

nlp_architect.data.glue_tasks.convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end=False, pad_on_left=False, cls_token='[CLS]', sep_token='[SEP]', pad_token=0, sequence_a_segment_id=0, sequence_b_segment_id=1, cls_token_segment_id=1, pad_token_segment_id=0, mask_padding_with_zero=True)[source]

Loads a data file into a list of InputBatch`s `cls_token_at_end define the location of the CLS token:

  • False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
  • True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]

cls_token_segment_id define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)

nlp_architect.data.glue_tasks.get_glue_task(task_name: str, data_dir: str = None)[source]

Return a GLUE task object :param task_name: name of GLUE task :type task_name: str :param data_dir: path to dataset, if not provided will be taken from

GLUE_DIR env. variable

nlp_architect.data.intent_datasets module

class nlp_architect.data.intent_datasets.IntentDataset(sentence_length=50, word_length=12)[source]

Bases: object

Intent extraction dataset base class

Parameters:sentence_length (int) – max sentence length
char_vocab

word character vocabulary

Type:dict
char_vocab_size

char vocabulary size

Type:int
intent_size

intent label vocabulary size

Type:int
intents_vocab

intent labels vocabulary

Type:dict
label_vocab_size

label vocabulary size

Type:int
tags_vocab

labels vocabulary

Type:dict
test_set

test set

Type:tuple of numpy.ndarray
train_set

train set

Type:tuple of numpy.ndarray
word_vocab

tokens vocabulary

Type:dict
word_vocab_size

vocabulary size

Type:int
class nlp_architect.data.intent_datasets.SNIPS(path, sentence_length=30, word_length=12)[source]

Bases: nlp_architect.data.intent_datasets.IntentDataset

SNIPS dataset class

Parameters:
  • path (str) – dataset path
  • sentence_length (int, optional) – max sentence length
  • word_length (int, optional) – max word length
files = ['train', 'test']
test_files = ['AddToPlaylist/validate_AddToPlaylist.json', 'BookRestaurant/validate_BookRestaurant.json', 'GetWeather/validate_GetWeather.json', 'PlayMusic/validate_PlayMusic.json', 'RateBook/validate_RateBook.json', 'SearchCreativeWork/validate_SearchCreativeWork.json', 'SearchScreeningEvent/validate_SearchScreeningEvent.json']
train_files = ['AddToPlaylist/train_AddToPlaylist_full.json', 'BookRestaurant/train_BookRestaurant_full.json', 'GetWeather/train_GetWeather_full.json', 'PlayMusic/train_PlayMusic_full.json', 'RateBook/train_RateBook_full.json', 'SearchCreativeWork/train_SearchCreativeWork_full.json', 'SearchScreeningEvent/train_SearchScreeningEvent_full.json']
class nlp_architect.data.intent_datasets.TabularIntentDataset(train_file, test_file, sentence_length=30, word_length=12)[source]

Bases: nlp_architect.data.intent_datasets.IntentDataset

Tabular Intent/Slot tags dataset loader. Compatible with many sequence tagging datasets (ATIS, CoNLL, etc..) data format must be int tabular format where: - one word per line with tag annotation and intent type separated by tabs <token> <tag_label> <intent>

  • sentences are separated by an empty line
Parameters:
  • train_file (str) – path to train set file
  • test_file (str) – path to test set file
  • sentence_length (int) – max sentence length
  • word_length (int) – max word length
files = ['train', 'test']

nlp_architect.data.ptb module

Data loader for penn tree bank dataset

class nlp_architect.data.ptb.PTBDataLoader(word_dict, seq_len=100, data_dir='/Users/pizsak/data', dataset='WikiText-103', batch_size=32, skip=30, split_type='train', loop=True)[source]

Bases: object

Class that defines data loader

decode_line(tokens)[source]

Decode a given line from index to word :param tokens: List of indexes

Returns:str, a sentence
get_batch()[source]

Get one batch of the data :returns: None

load_series(path)[source]

Load all the data into an array :param path: str, location of the input data file

Returns:

reset()[source]

Resets the sample count to zero, re-shuffles data :returns: None

class nlp_architect.data.ptb.PTBDictionary(data_dir='/Users/pizsak/data', dataset='WikiText-103')[source]

Bases: object

Class for generating a dictionary of all words in the PTB corpus

add_word(word)[source]

Method for adding a single word to the dictionary :param word: str, word to be added

Returns:None
load_dictionary()[source]

Populate the corpus with words from train, test and valid splits of data :returns: None

save_dictionary()[source]

Save dictionary to file :returns: None

nlp_architect.data.sequence_classification module

class nlp_architect.data.sequence_classification.SequenceClsInputExample(guid: str, text: str, text_b: str = None, label: str = None)[source]

Bases: nlp_architect.data.utils.InputExample

A single training/test example for simple sequence classification.

nlp_architect.data.sequential_tagging module

class nlp_architect.data.sequential_tagging.CONLL2000(data_path, sentence_length=None, max_word_length=None, extract_chars=False, lowercase=True)[source]

Bases: object

CONLL 2000 POS/chunking task data set (numpy)

Parameters:
  • data_path (str) – directory containing CONLL2000 files
  • sentence_length (int, optional) – number of time steps to embed the data. None value will not truncate vectors
  • max_word_length (int, optional) – max word length in characters. None value will not truncate vectors
  • extract_chars (boolean, optional) – Yield Char RNN features.
  • lowercase (bool, optional) – lower case sentence words
char_vocab

character Vocabulary

chunk_vocab

chunk label Vocabulary

dataset_files = {'test': 'test.txt', 'train': 'train.txt'}
pos_vocab

pos label Vocabulary

test_set

get the test set

train_set

get the train set

word_vocab

word Vocabulary

class nlp_architect.data.sequential_tagging.SequentialTaggingDataset(train_file, test_file, max_sentence_length=30, max_word_length=20, tag_field_no=2)[source]

Bases: object

Sequential tagging dataset loader. Loads train/test files with tabular separation.

Parameters:
  • train_file (str) – path to train file
  • test_file (str) – path to test file
  • max_sentence_length (int, optional) – max sentence length
  • max_word_length (int, optional) – max word length
  • tag_field_no (int, optional) – index of column to use a y-samples
char_vocab

characters vocabulary

char_vocab_size

character vocabulary size

test_set

Get the test set

train_set

Get the train set

word_vocab

words vocabulary

word_vocab_size

word vocabulary size

y_labels

return y labels

class nlp_architect.data.sequential_tagging.TokenClsInputExample(guid: str, text: str, tokens: List[str], label: List[str] = None)[source]

Bases: nlp_architect.data.utils.InputExample

A single training/test example for simple sequence token classification.

class nlp_architect.data.sequential_tagging.TokenClsProcessor(data_dir, tag_col: int = -1)[source]

Bases: nlp_architect.data.utils.DataProcessor

Sequence token classification Processor dataset loader. Loads a directory with train.txt/test.txt/dev.txt files in tab separeted format (one token per line - conll style). Label dictionary is given in labels.txt file.

get_dev_examples()[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

static get_labels_filename()[source]
get_test_examples()[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples()[source]

Gets a collection of `InputExample`s for the train set.

get_vocabulary()[source]

nlp_architect.data.utils module

class nlp_architect.data.utils.DataProcessor[source]

Bases: object

Base class for data converters for sequence/token classification data sets.

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_labels()[source]

Gets the list of labels for this data set.

get_test_examples(data_dir)[source]

Gets a collection of `InputExample`s for the test set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

class nlp_architect.data.utils.InputExample(guid: str, text, label=None)[source]

Bases: abc.ABC

Base class for a single training/dev/test example

class nlp_architect.data.utils.Task(name: str, processor: nlp_architect.data.utils.DataProcessor, data_dir: str, task_type: str)[source]

Bases: object

A task definition class :param name: the name of the task :type name: str :param processor: a DataProcessor class containing a dataset loader :type processor: DataProcessor :param data_dir: path to the data source :type data_dir: str :param task_type: the task type (classification/regression/tagging) :type task_type: str

get_dev_examples()[source]
get_labels()[source]
get_split_train_examples(labeled: int, unlabeled: int)[source]

split the train set into 2 sub sets (given by input size) to be used as labelled and unlabeled sets for semi-supervision tasks

get_test_examples()[source]
get_train_examples()[source]
nlp_architect.data.utils.read_column_tagged_file(filename: str, tag_col: int = -1)[source]

Reads column tagged (CONLL) style file (tab separated and token per line) tag_col is the column number to use as tag of the token (defualts to the last in line) return format : [ [‘token’, ‘TAG’], [‘token’, ‘TAG2’],… ]

nlp_architect.data.utils.read_tsv(input_file, quotechar=None)[source]

Reads a tab separated value file.

nlp_architect.data.utils.sample_label_unlabeled(samples: List[nlp_architect.data.utils.InputExample], no_labeled: int, no_unlabeled: int)[source]

Randomly sample 2 sets of samples from a given collection of InputExamples (used for semi-supervised models)

nlp_architect.data.utils.write_column_tagged_file()[source]

Module contents